﻿ Automatic extraction of syntactic patterns for dependency parsing in NP chunks Mihaela Colhon Dan Cristea Departament of Computer Faculty of Computer Science Science „Alexandru Ioan Cuza” University of Iaşi University of Craiova Institute for Computer Science mcolhon@inf ucv ro Romanian Academy – the Iași branch dcristea@info uaic ro Abstract In this article we present a method for automatic extraction of syntactic patterns that are used to develop a dependency parsing method The patterns have been extracted from a corpus that has been automatically annotated for tokens, sentences’ borders, part of speech and noun phrases and manually annotated for dependency relations between words The evaluation shows promising results in the case of an order-free language 1 Introduction Reliable dependency parsing is a notorious difficult problem in Natural Language Processing (NLP) We describe in this paper a pattern-based approach in dependency parsing that addresses only some syntactic chunks of texts The chunk isles that we concentrate on are a class of Nominal Phrases (NPs) displaying a rather limited recursivity, for which reliable chunkers are known to exist The idea is that, generally, NP constituents cover a significant part of a sentence and if their dependency structures are known, we would be closer to recuperate the whole structure of a sentence Moreover, often, NPs, sometimes augmented to prepositional phrases (PPs), fulfil semantic roles around verbs The ambiguity of attaching an NP/PP to a verb can be reduced by benefiting from a semantic roles parser1 Therefore, an approach for depicting the hidden dependency structure of NPs can be combined to other syntactic and semantic methods, on the way to build the whole dependency structure of a sentence It is worth noting that quite often there is no consensus on what the correct dependency structure for a particular sentence should be To build a dependency treebank, the human annotators must decide for each word which is the one it depends on To determine dependencies often involves a deep interpretation process and this is why for the same word sequence, sometimes, different dependency structures could be negotiated among annotators In contrast, the decisions the human annotators should take while building phrase-structure treebanks usually offer much less ambiguity Generally, for treebanks reflecting the syntax of languages that have a free word order (as are for instance Czech, Romanian, etc ) a dependency structure representation is preferred to a constituent structure representation (e g , the Prague Dependency Treebank in the case of Czech) Following this line, at the Faculty of Computer Science of the „Alexandru Ioan Cuza” University of Iaşi, in the Natural Language Processing Group, a dependency Treebank for the Romanian language was recently built We have used this resource in the training and evaluation stages of the proposed mechanism 1 When lexical-semantic inputs are exploited, the outcome, of course, goes beyond the syntactic level of representation We have implemented a system for generating and applying corpus-based patterns in dependency parsing mechanism The proposed method works as follows: starting from flat NP sequences and using a set of syntactic patterns extracted from a training corpus, we identify dependency links between the words of the NPs based on their MSD2 data Thus we transform flat NP structures into dependency trees, not necessarily complete The knowledge enclosed in this dependency parsing is structured as syntactic patterns defined in terms of morpho-syntactic specifications, as they were developed during the MULTEXT-East Project The article is structured as follows We start by presenting the state of the art in the domain of dependency parsing, by focusing on the pattern-based approaches In the following section we present the structure of the corpus used in the system training and in the evaluation process In section 4 we describe the patterns construction method and in the last two sections we discuss the obtained evaluation scores and formulate some conclusions 2 State of the art In this paper we present a method for automatic construction of dependency grammar rules that cover NP sequences based on a collection of pairs: > extracted from a corpus, where is a sequence of MSD tags representing an NP, represents a dependency relation, and and give the positions of the head word and the dependent word in the chunk between which the dependency link should occur We deterministically associate dependency rules to the syntactic structures of NP sequences by using a matching method based on regular expressions that take into consideration also contextual information As such, the parsing problem has a lot to do with the pattern-matching problem: each sequence of MSD tags given as input is matched against such an pattern that has been previously extracted from the corpus If the matching succeeds, then the attached to the pattern will be instantiated between the two words, belonging to the chunk, indicated by the indexes In case of failure, of course, no dependency relation will be identified In pattern-based approaches, it is important to properly choose the pattern description formalism and the pattern structures Hearst defines in a scheme for patterns that was often taken as a reference point: patterns are regular expressions with lemmatized word forms as the alphabet and variables corresponding to noun phrases (for example: “NP1 such as NP2” and “NP1 is a/an NP2” are patterns defined in order to encode the hypernymy relation) In the authors report that these patterns have been successful at identifying some examples of WordNet relationships (only that, applied to the construction of lexical thesauri the method has a low proficiency due to the small number of patterns typically employed) In the approach we present in this paper we do not rely on lexical information Our syntactic patterns include only MSD tags that have been automatically marked in the corpus We are aware that in many cases of ambiguity adding lexical information should result in an improvement of the performances and therefore is an option to be considered in the future However, lexical information implies also a much larger training corpus, which was not at our fingers at present 2 Morpho Syntactic Description - notation used in the Multext projects ( ) 3 The corpus The corpus used in this study contains the text of the first chapter of George Orwell's novel “1984” In order to make it useful as a Romanian Dependency Treebank, three levels of annotation have been added to the raw text and encoded in XML, by adopting a simplified form of the XCES standard : - segmentation and lexical information: sentences have their boundaries marked and each token has attached its part of speech, lemma, and morpho-syntactic information (gender, number, person, case, etc ) by running an automatic processing chain that includes: sentence segmentation, tokenisation, POS-tagging and lemmatisation ; - noun phrases: NP chunks are automatically marked and corrected manually; although the NPs could be recursive, their complexity is small because they do not include relative clauses; - dependency data: tokens in sentences have been manually annotated for their head- words and the corresponding dependency relations As it will be shown, these levels of annotation are sufficient to extract surface patterns and to associate them with dependency rules that will finally configure a dependency parser The corpus thus acquired contains 374 sentences, 6778 tokens and 650 NP structures A different section of the same corpus was used in the evaluation process 4 Extracting the patterns The primary data given by the corpus is used to associate dependency structures with each sequence of MSD tags corresponding to NPs extracted from the corpus and which will be called in the following morphological structures As we have already said, the corpus used in this study puts in evidence three levels of annotations: POS tags, NP chunks and dependency relations For each NP chunk found in the corpus, we extracted the morphological structure and all the dependency relations marked by the human annotator between the words of the NP chunk The collection of syntactic patterns thus acquired covers all the syntactic structures found in the corpus For example, if we take the following bracket representation for a Romanian noun phrase „un fond de rezervă” (”a reserve fund”): [NP[Timsr un] [Ncms-n fond] [Spsa de] [Ncfsrn rezervă]] its morphological structure is: Timsr Ncms-n Spsa Ncfsrn3 A sequence of N MSD tags should be described by N-1 patterns, each putting in evidence one of the N-1 internal heads4 Thus, if the sequence of words of above belongs to the corpus, to the 4 MSD tags 3 patterns will be attached, one attaching „un” to „fond”, one „de” to „fond” and one „rezervă” to „de” The morphological structures become the syntactic patterns of the dependency parser by means of a generalization process described further down To each such pattern we associate the corresponding dependency information, more precisely, the dependency relations and their involved head and dependent components 3 See the Appendix A for the meaning of the MSD tags in this paper 4 This is because we do not consider discontinued NPs, in other words all tokens covered by a pattern belong to the NP The structure of the nominal phrase depends on the category of the head-word of the group NPs are generally well individuated on FDG trees as sub-trees having as root nominal categories (mostly nouns, but also pronouns, numerals can act as NP heads) To the morphological structures found in the corpus we have applied a generalization process This was necessary in order to cover similar syntactic constructions for which no occurrences have been found in the corpus, but also for reducing the total number of patterns During the generalization, syntactic patterns are induced from the extracted morphological structures, by collapsing, for the same dependency relation, the contextual information or the MSD tags of the involved dependency elements into regular expressions For example, a syntactic pattern of the form5: Nc[fm]sry Ncfsoy (Pp3fso |Afpfson)? codifies six NP morphological structures: Ncfsry Ncfsoy Pp3fso- or Ncmsry Ncfsoy Pp3fso- or Ncfsry Ncfsoy Afpfson or Ncmsry Ncfsoy Afpfson or Ncfsry Ncfsoy or Ncmsry Ncfsoy In order to derive the dependency structure of NP sequences, only the dependency relations occurring internally in NP chunks have been used in the parser rules The most frequently encountered dependency relations found in the training corpus are: - substantival attribute (marked as a subst ) – linking two nouns, as in „casa plăcerii” (En: "house of pleasure"); - adjectival attribute (marked as a adj ) – linking an adjective to the noun it modifies, as in „frumoasa fată" (En: ”the beautiful girl”); - determinant (marked as det ) – linking an article or a determiner to the noun it modifies, as in „o carte” (En: ”a book”); - coordination (marked as coord ) – usually linking conjuntions or punctuation marks to words having identical part of speeches, as in „tânără şi frumoasă” (En: ”young and beautiful”) The main idea of the parser is that words with similar tags appearing in similar contexts to the ones found in the corpus will be linked by the same dependency relations The following examples show patterns covering structures with 3 MSD tags: 1 pattern: (structure "Ncmsry Ncmsoy Ncfsry") (rel name a subst ) (head 1) (dependent 2) pattern: (structure "Ncmsry Ncmsoy Ncfsry") (rel name a subst ) (head 2) (dependent 3) These patterns apply to the sequence „acoperișul Blocului Victoria” (the Victoria block’s roof) Both match identical sequences of tokens, containing MSD tags which include the a subst relations: for the first pattern, the relation occurs between the first two noun tags (that is Ncmsry and Ncmsoy) while, in the second pattern, the same a subst relation connects the last two tags (more precisely, Ncmsoy and Ncfsry) 5 See Appendix B for a description of the notations used in the regular expressions in this paper 2 pattern: (structure "Tdfpr Mcfp-l Ncfp-n") (head 2) (dependent 1) (rel name det ) pattern: (structure " Tdfpr Mcfp-l Ncfp-n ") (head 3) (dependent 2) (rel name a adj ) These patterns apply to the sequence „cele Două Minute” (the Two Minutes) The patterns treat sequences involving an article, a numeral and a noun 3 pattern: (structure "(Tsms)? Ncms[or]y Afpms-n") (head 1) (dependent 2) (rel name a adj ) A sequence matched by the pattern Tsms Ncmsoy Afpms-n, which is an instance of this pattern including the first, optional, element, is “al Partidului Interior” (of the Inner Party), while a sequence matching the pattern without the first element is “pieptul solid” (the solid breast) Here, the a adj relation connects the tokens having the MSD tags Ncmsoy and Afpms-n or Ncmsry and Afpms-n As noticed, the pattern has a non-compulsory left context for this relation – the tag Tsms Optional elements (is this and similar notations) are not counted 4 pattern: (structure "Tsms Ncmsoy Afpms-n") (head 2) (dependent 1) (rel name det ) pattern: (structure "Tsms Ncmsoy Afpms-n") (head 2) (dependent 3) (rel name a adj ) These patterns match the same sequence “al Partidului Interior” (of the Inner Party) as above, but the pattern now identifies the det relation connecting the first two tokens (with the head in the second position) The pattern has a right context for this relation – the tag Afpms-n The second pattern indicates an a adj relation between the last two tokens, and this relation has a left context given by an article tag Tsms It is of course possible that the corpus will reveal more than exactly one relation to occur between identical pairs of MSD tags In this case the parser will include them all in the result As mentioned already, the disambiguation between these cases should be done at a lexical level – not treated in this study The reverse case is when the NP morphological structure is not completely parsed by the dependency model, because the rules in the collection do not include the complete combination of tags in the input Such misses affect negatively the recall 5 Evaluation We adopted Hajič’s formulas in the evaluation of the dependency parser: • Dependency Recall where Correct(D) is the number of correct dependencies found by the parser (the word is attached to its true head and the relation has got the correct label) and |S| is the size of test data in words (since |dependencies| = |words|) • Dependency precision where Generated(D) is the number of dependencies found by the parser Because of the small size of the corpus (374 sentences) we decided to apply a 3-fold cross validation, which ensures that all the sentences of the corpus are used for both training and testing This procedure gave us a training corpus of 282 sentences and a testing corpus of 92 sentences The results of the parser evaluation are given in the Table 1 Table 1 The parser evaluation scores Dependency relation Precision Recall ALL 0 89 0 87 substantival atribute 0 83 0 90 adjectival atribute 0 97 0 88 determinant 0 88 0 97 coordination 1 0 0 33 The evaluation shows a rather high precision (above 83%) for the most frequent dependency relations found in NP constructions, but very low values of recall for the coordination relation Initially we feared an error seeing the failure to recognize the coordination relation, especially since it connects only adjectives and nouns with a conjunction or punctuation We realized then that this happens because the generalization rules that we have implemented do not cover many instances of variation in the morpho-syntactic features There is a difficult trade in designing proper generalization rules because making them too lax could trigger false instances On the other hand, making them too straight, results in fewer applications, therefore a drop of recall, as here We know that this is a critical issue on which we will have to insist in the following research E expect that more data in the corpus will contribute to the enhancement of this evaluation score 6 Conclusions We presented here a rule-based syntactic parser for analyzing dependencies inside NPs, whose rules have been automatically inferred from a corpus that resulted from a machine-human collaboration Usually the pattern-based approaches are developed for fixed-order languages, like English Still, we have shown that good results can be obtained also for freer-order languages, such as Romanian We consider that the overall evaluation scores of this rule-based parser shows that the method can be applied satisfactorily, even if the training corpus is reduced Moreover, we are confident that a refinement of the generalization mechanism could bring further enhancements Acknowledgments The first author received support for this research from the strategic grant POSDRU/89/1 5/S/61968, Project ID 61986 (2009), co-financed by the European Social Fund within the Sectorial Operational Program Human Resources Development 2007-2013 Our thanks go to Radu Simionescu for realising the automatic annotation of the “1984” corpus with markers for SENTENCE, TOK, POS and NP, to Augusto Perez – who did the manual annotation of the dependency relations, and to the METANET4U project – which supported the creation of the automatic annotation tools References Michael Collins, Three generative, lexicalised models for statistical parsing, In Proceedings of ACL, 1997 Tomaž Erjavec, Multext-east version 4: Multilingual morphosyntactic specifications, lexicons and corpora In LREC, 2010 Jan Hajič, Introduction to natural language processing: Treebanks, treebanking and evaluation, Course notes Jan Hajič, Alena Böhmová, Eva Hajičová, and Barbora Vidová-Hladká, The prague dependency treebank: A three-level annotation scenario, A Abeillé, editor, Treebanks: Building and Using Parsed Corpora, Amsterdam:Kluwer, pages 103-127, 2000 Matti A Hearst, Automatic acquisition of hyponyms from large text corpora, In Proceeedings of COLING-92, pages 539-545, 1992 Nancy Ide, P Bonhomme, and L Romary, Xces: An xml-based encoding standard for linguistic corpora In Proceedings of the Second International Language Resources and Evaluation Conference Paris: European Language Resources Association, 2000 Roman Kurc, Maciej Piasecki, and Stan Szpakowicz, Corpus-based extraction of morpho-syntactic patterns for the automatic acquisition of hypernymy, Intelligent Information Systems, pages 77- 90, 2010 David M Magerman, Statistical decision-tree models for parsing, In Proceedings of the 33rd ACL, 1995 Mitchell P Marcus, B Santorini, and M A Marcinkiewicz, Building a large annotated corpus of English: the Penn Treebank, Computational Linguistics, 19(2):313-330, 1994 Raymond J Mooney, Symbolic machine learning for natural language processing, In ACL '99 Tutorial, 1999 Romanian Academy Institute of Linguistics „Iorgu Iordan – Alexandru Rosetti”, Basic Grammar of Romanian Lanuage (in Romanian), Univers Enciclopedic Gold Publishing Group, 2010 Cenel Augusto Perez, Casuistry of Romanian functional dependency grammar, Mihai Alex Moruz, Dan Cristea, Dan Tufiş, Adrian Iftene, Horia Niculai Teodorescu (eds ) Proceedings of the 8th International Conference “Linguistic Resources And Tools For Processing Of The Romanian Language”, pages 19-28, 2012 Radu Simionescu, Graphical grammar studio as a constraint grammar solution for part of speech tagging, Mihai Alex Moruz, Dan Cristea, Dan Tufiş, Adrian Iftene, Horia Niculai Teodorescu (eds ) Proceedings of the 8th International Conference “Linguistic Resources And Tools For Processing Of The Romanian Language”, pages 109-118, 2012 Radu Simionescu, Romanian deep noun phrase chunking using graphical grammar studio, Mihai Alex Moruz, Dan Cristea, Dan Tufiş, Adrian Iftene, Horia Niculai Teodorescu (eds ) Proceedings of the 8th International Conference “Linguistic Resources And Tools For Processing Of The Romanian Language”, pages 135-143, 2012 Rion Snow, Daniel Jurafsky, and Andrew Y Ng, Learning syntactic patterns for automatic hypernym discovery, In Advances in Neural Information Processing Systems (NIPS 2004), pages 13-18, 2004 APPENDIX A — GLOSSARY OF NOTATION The following table gives the notation used in this paper: MSD tag The meaning of the notation (according to MULTEXT-East lexical specifications) Afpfson Adjective qualifier positive feminine singular oblique -definiteness Afpfsrn Adjective qualifier positive feminine singular direct -definiteness Afpms-n Adjective qualifier positive masculine singular -definiteness Ncfsoy Noun common feminine singular oblique +definiteness Ncfsrn Noun common feminine singular direct -definiteness Ncfsry Noun common feminine singular direct +definiteness Ncmsoy Noun common masculine singular oblique +definiteness Ncmsry Noun common masculine singular direct +definiteness Ncms-n Noun common masculine singular -definiteness Pp3fso- Pronoun personal third feminine singular oblique Spsa Adposition preposition simple accusative Tifsr Article indefinite feminine singular direct Timsr Article indefinite masculine singular direct Tsms Article possessive masculine singular APPENDIX B — GLOSSARY OF PATTERNS NOTATIONS Meta-character Meaning [ ] Match anything inside the square brackets for ONE character position once and only once ( ) The open parenthesis and close parenthesis are used to group parts of the included expression together ? The ? (question mark) matches the preceding character 0 or 1 times only | The | (vertical bar or pipe) means OR logical between the left hand and the right hand of its values 